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Abstract 

Many real-world complex networks are best modeled as bipartite (or 2-mode) graphs, where 
nodes are divided into two sets with links connecting one side to the other. However, there is cur- 
rently a lack of methods to analyze properly such graphs as most existing measures and methods are 
suited to classical graphs. A usual but limited approach consists in deriving 1-mode graphs (called 
projections) from the underlying bipartite structure, though it causes important loss of information 
and data storage issues. We introduce here internal links and pairs as a new notion useful for such 
analysis: it gives insights on the information lost by projecting the bipartite graph. We illustrate the 
relevance of theses concepts on several real-world instances illustrating how it enables to discrim- 
inate behaviors among various cases when we compare them to a benchmark of random networks. 
Then, we show that we can draw benefit from this concept for both modeling complex networks and 
storing them in a compact format. 

1 Introduction 

Many real-world networks have a natural bipartite (or 2-mode) structure and so are best modeled by 
bipartite graphs: two kinds of nodes coexist and links are between nodes of different kinds only. Typical 
examples include biological networks in which proteins are involved in biochemical reactions, occur- 
rence of words in sentences of a book, authoring of scientific papers, file-provider graphs where each 
file is connected to the individuals providing it, and many social networks where people are members of 
groups like directory boards. See UNewman et al. 200T1 Latapy et al. 2008] for more examples. 



The classical approach for studying such graphs is to turn them into classical (non-bipartite) graphs 
using the notion of projection: considering only one of the two types of nodes and linking any two nodes 
if they share a neighbor in the bipartite graph. This leads for instance to cooccurrence graphs, where two 
words are linked if they appear in a same sentence, coauthoring graphs, where two researchers are linked 
if they are authors of a same paper, interest graphs where individuals are linked together if they provide 
a same file, etc. 



This approach however has severe drawbacks [Latapy et al. 2008 1. In particular, it leads to huge pro 



jected graphs, and much information is lost in the projection. There is therefore much interest in methods 
that would make it possible to study bipartite graphs directly, without resorting to projection. Despite 
previous efforts to develop such methods ULind et al. 20051 Latapy et al. 2008 Zweig et al. 201 1] , much 
remains to be done in this direction. 

We propose in this paper a new notion, namely internal links and pairs, useful for the analysis of 
real- world bipartite graphs. We introduce it in Section then present some datasets in Section |3]that we 
use as typical real-world cases which we analyze in Section |4] with regard to our new notion. We explore 
a more algorithmic perspective in Section [5] 
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2 Internal pairs and links 



Let us consider a bipartite graph G = (_L,T,£), with E C _L x T. We call nodes in _L (resp. T) the 
bottom (resp. top) nodes. We denote by N(u) = {v € (_L U T) , (w, v) £ E} the neighborhood of any node 
u. We extend this notation to any set S of nodes as follows: N(S) = U ve sN(v). 

The ± -projection of G is the graph G± = (J-,E±) in which (u,v) € E± if u and v have at least one 
neighbor in common (in G): N(u) [~\N{v) 7^ 0. We will denote by N±(u) the neighborhood of a node u in 
Gj_: Afj_(w) = {v £ 1, (u,v) G =N(N(u)). The T-projection G T is defined dually. 

For any pair of nodes (u, v) ^ E, we denote by G + (u, v) the graph G' = (_L,T,£ U {(w,v)}) obtained 
by adding the new link (m,v) to G. For any link (m,v) € £\ we denote by G — (u,v) the graph G' = 
(_L, TjE 1 \ {(m, v)}) obtained by removing link (w,v) from G. 

Definition 1 (internal pairs) A /?a/r of nodes (u,v) with (u,v) £ E is a ±-internal pair of G if the _L- 

projection ofG' = G + (u,v) is identical to the one ofG. We define T -internal pairs dually. 
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Figure 1 : Example of _L-internal pair. Left to right: a bipartite graph G, the bipartite graph G' obtained 
by adding link (B,l) to G, and the _L -projection of these two graphs. As G' L = G±, (B,l) is a ±-internal 
pair of G. 



Definition 2 (internal links) A finfc (w,v) € E is a _L-internal link ofG if the ^-projection ofG' 
(u,v) is identical to the one ofG. We define T -internal links dually. 



G- 






Figure 2: Example of _L-internal link. Left to right: a bipartite graph G, the bipartite graph G' obtained 
by removing link (B,j) from G, and the _L -projection of these two graphs. As G' L = G±, (B,j) is a 
_L-internal link of G. 

In other words, (u,v) is a ^-internal pair of G if adding the new link (u,v) to G does not change its 
_L -projection; it is a ±-internal link if removing link (w, v) from G does not change its _L -projection. See 
Figure [T]and |2] for examples. 



The notion of internal link is related to the redundancy of a node pLatapy et al. 2008] , defined for 
any node v as the fraction of pairs in N (v) that are still linked together in the projection of the graph G' 
obtained from G by removing v and all its links (all these pairs are linked in G±). There is however no 
direct equivalence between the two notions. The redundancy is a node-oriented property: it gives a value 
for each node, while the notion of internal links and pairs is link-oriented. As illustrated on Figure [3] 
nodes exhibiting the same fraction of internal links may have different redundancies, and conversely 
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two nodes having the same redundancy may correspond to different internal connectivity patterns. It is 
possible to classify the links of each node as _L and T-internal or not; this induces a notion of ^-internal 
degree of a node (resp. T-internal degree), which is its number of internal links (see next section). 
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Figure 3: Redundancy versus internal links. In this graph, B and D have the same fraction of _L-internal 
links (|) while having different redundancies (resp. | and -j^). 

We now give a characterization of internal links, which does not explicitly rely on the projection 
anymore and provides another point of view on this notion. 

Lemma 1 A link (h, v) ofG is ±.-internal if and only ifN(v) \ {u} C N(N(u) \ {v}). 

Proof: Let us consider a link (u,v) G E and let G' = G — (u,v) be the bipartite graph obtained by 
removing the link (u,v) from G. Then, by definition, E± = E' ± U {(u,x), x G N(v) \ {«}}. 

Suppose that (u,v) is a _L-internal link, i.e. E± = E' ± . Then all links (u,x) in the expression above 
already belong to E'j_. Therefore, for each x G N(v) \ {u}, 3 y ^ v G T such that y G N(u) r\N(x). By 
symmetry, x G N(y) and y G N(u) \ {v} therefore, x G N(N(u) \ {v}) and so N(v) \ {u} C N(N(u) \ {v}). 

Suppose now that N(y) \ {u} C N(N(u) \ {v}). Then for each node x G N(y) \ {u}, 3 y G N(u) \ {v} 
such that x G N(y). Thus, by definition of the projection, (u,x) G E' ± . Therefore E± = E' ± and the link 
(u, v) is _L-internal. □ 



3 Datasets 

Our aim is to evaluate the importance of internal pairs and links in large real-world graphs, rather than 
obtain specific conclusions in a particular context. That is why we study various instances of real-world 
bipartite graphs, expecting to observe different behaviors. We present in this section the datasets we will 
use and summarize their general features (number of nodes and links). The graphs under consideration 
are social ones connecting people (_L-nodes) through events, groups or similar interests (T-nodes). 

• Imdb-movies [Bara basi et al. 19991 is obtained from the Internet Movie Database (www . imdb . com): 
it features actors connected to the movies they played in. |_L| = 127,823 actors, |T| = 383,640 
movies, \E\ = 1,470,418. 

• Delicious-tags IG orlitz et al. 2 008 ] consists of Delicious (Iwww . delicious . coml) users connected 
to the tags they use for indexing their bookmarks. |_L| = 532,924 users, |T| = 2,474,234 tags, 
\E | = 37,421,585. 

• Flickr-tags [Prieur et al. 2008] consists of Flickr (www.flickr.com) users connected to the tags 
they use for indexing their photos. |_L| =319,675 users, |T| = 1,607,879 tags, \E\ = 13,336,993. 

• Flickr-comments: same as above, except that Flickr users are linked to the photos they comment. 
|_L| =760,261 users, |T| = 12,678,244 photos, \E\ = 41,904, 158. 

• Flickr-favorites: same as above, except that users are linked to the photos they pick up as favorites. 
|_L| = 321,312 users, |T| = 6,450,934 photos, \E\ = 17,871,828. 
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• Flickr- groups: same as above, except that users are linked to the groups they belong. _L | = 72, 875 
users, |T| = 381,076 groups, \E\ = 5,662,295. 

• P2P-files MAidouni et al. 20091 is obtained from peer-to-peer file exchange eDonkey: users are 
linked to the files they provide. |_L| = 122,599 peers, |T| = 1,920,353 files, \E\ = 4,502,704. 

• PRL-papers has been extracted from the Web of Science database (www . isiwebof knowledge . com), 
collecting papers and authors of Physical Review Letters from 2004 to 2007. [_L| = 15,413 authors, 
|T| =41,633 papers, \E\ =249,474. 

4 Analysis of real-world cases 

In this section, we use the notions of internal links and pairs introduced in Section |2] to describe the 
real-world cases presented in Section [3] Let us insist on the fact that our aim is not to provide accurate 
information on these specific cases, but to illustrate how internal links and pairs may be used to analyze 
real- world data. We first show that there are many internal links in typical data, then study the number 
of internal links of each node and the correlation of this number with the node's degree. 

Since the links attached to T-nodes (resp. ±-nodes) of degree 1 are all ±-internal (resp. T-internal), 
and since there may be a large fraction of nodes with degree 1 in real-world graphs, we only study in the 
sequel links attached to nodes with degree at least 2. 

4.1 Amount of internal links and pairs 

In order to capture how redundant is the bipartite structure, we compute the number of T- and ±-internal 
pairs and links. The fraction of internal links, denoted /e, and presented in Table Q] seems in general 
not negligible. A quantitative analysis of these values however requires the definition of a benchmark. 
That is why we compare the measures to the corresponding amounts on random bipartite graphs with the 
same sizes and degree distributions, which is a typical random model to evaluate the deviation from an 
expected behavior - see for example UNewman et al. 200T1 INewman et al. 20031 . The measures related 
to this model will be referred to with the symbol *. 

We denote by <^/(_L) (resp. <^/(T)) the set of _L-internal pairs (resp. T-internal pairs) and by Ej(±.) 
(resp. Ej(T)) the set of _L-internal links (resp. T-internal links). We normalize the number of internal 
pairs and links measured on real graphs to the values obtained with the model described above. The 
corresponding results are also presented in Table [T] 
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E*(±) 




E*(T) 


Imdb-movies 


0.031 


0.441 


47.0 


0.026 


0.491 


147 


Delicious-tags 


0.112 


0.972 


1.47 


0.104 


1.823 


5.31 


Flickr-tags 


0.117 


0.920 


1.51 


0.048 


1.040 


2.50 


Flickr-comments 


0.398 


0.258 


4.22 


0.002 


0.151 


22.0 


Flickr-groups 


0.228 


0.491 


2.21 


0.015 


0.249 


2.86 


Flickr-favorites 


0.172 


0.574 


2.02 


0.002 


0.704 


12.4 


P2P-files 


0.337 


0.082 


8.53 


0.136 


0.092 


1430 


PRL-papers 


0.718 


0.033 


7.17 


0.487 


0.001 


11.2 



Table 1: Fraction of internal links {Je,), number of internal pairs (^V) and internal links (Ej) of real- 
world graphs normalized to the values on random bipartite graphs with the same size and same degree 
distributions. 
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We first notice that the behaviors in regards to the amount of internal links are very heterogeneous. 
Still some general trends can be underlined: in the random case, _L- and T-internal links are underesti- 
mated. So, the probability of having nodes sharing the same neighborhood is higher in real graphs than 
in random ones. We may indeed expect, for instance, that people participating to the same paper have a 
higher probability to be coauthors of another one than a random pair of authors. 

Meanwhile the numbers of internal pairs are generally overestimated in random networks. To under- 
stand this effect, let us consider the extreme case where two _L-nodes in a graph have either exactly the 
same neighborhood, or no common neighbors. Then all links are _L-internal, and the graph does not con- 
tain any internal pair. This example suggests that the number of internal pairs is probably anti-correlated 
to the number of internal links. 

In general, there is a correlation between the fact that the number of internal links is underestimated 
in random graphs and the fact that the number of internal pairs is overestimated, but this correlation does 
not hold in all cases. Moreover, there is no direct link between these observations and the sizes or average 
degrees of the considered graphs. 

Finally, we observe a specific behavior for the two graphs which correspond to tagging databases, i.e. 
Delicious-tags and Flickr-tags. For these graphs we observe the lowest gaps between the real and random 
cases for _L-internal links and the amounts of _L-internal pairs are very close in the real and random 
cases. Conversely, they are the only graphs for which the amount of T-internal pairs is underestimated 
in random graphs. 

Since we can observe a wide range of behaviors both for T- and _L- internal links and pairs, we will 
restrict our analysis in the following to _L- internal links and pairs for the sake of brevity. We will see 
that this allows enlightening observations. 

4.2 Distribution of internal links among nodes 

The notion of internal links partitions the links of each node into two sets: the internal ones and the 
others. We now study how the fraction of internal links is distributed among nodes. On Figure |4] we 
plot the complementary cumulative distribution of the fraction of internal links per node for the datasets 
under study. We also plot the complementary cumulative distribution for random graphs. 
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Figure 4: Complementary cumulative distribution of the fraction of internal links per node. 
One of the most noticeable differences between both curves lies in the probability of having a node 



whose links are all internal (x = 1): this fraction is indeed much higher in real than in random graphs. 
We also observe that real graphs exhibit fewer nodes with very low (or null) fractions of internal links 
(though the fraction of nodes with no internal link is high in both cases). In this respect too, the datasets 
behave differently: for Imdb-movies the probability of having a 10~ 2 fraction of internal links is more 
than one order of magnitude larger in the random than in the real graph, while Flickr-tags curves are 
close to be superimposed at low fractions. Notice that this is not directly related to the fact that the 
number of internal links is underestimated or not in random graphs: for Delicious-tags the ratio between 
the number of _L-internal links in the real and in the random case is smaller than for Flickr-tags, but the 
difference between the distributions of the fraction of internal links per node are larger for Delicious-tags 
than for Flickr-tags. 

Finally, the very low fractions that we observe are associated to nodes with high degree: to have a 
10~ 4 fraction of internal links, a node has to have a degree of at least 10 4 . Therefore, we study in the 
following the correlation between the degree of a node and its number of internal links. 

4.3 Correlation of internal links with node degrees 

As stated before, the number of (_L-)internal links of a node is called its (±-)internal degree, its total 
number of links being its degree. We investigate in this section the relationship between both quantities, 
plotting on Figure [5]the average degree of a node in regards its (_L-)internal degree for the real datasets 
and the randomized ones. 

Imdb-movies: Delicious-tags: Flickr-tags: Flickr-comments: 




Flickr- groups: Flickr-favorites: P2P-files: PRL-papers: 




Figure 5: Average degree as a function of the internal degree (for users projection). 



We observe that both real and random curves in several cases can be approximated by a sub-linear 
law on several decades. However, this model is unsatisfactory on P2P-files database, and questionable 
on cases where the values are too rare or too scattered: most noticeably Imdb-movies and Flickr-groups. 
The dispersion observed at large degrees is a consequence of the heterogeneous degree distribution, the 
number of nodes with high degree being low. 

If the fact that a given link is internal or not was independent from the node's degree, these curves 
would be linear. As random graphs have a sublinear behavior, that means that nodes with large degrees 
have on average a higher fraction of internal links. This effect can be explained qualitatively: increasing 
the degree of a node u - everything being otherwise unchanged - implies increasing the probability that 
one of his neighbors v is such that N{v) \ {u} C N(N(u) \ {v}). 
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On the other hand, the slope for real graphs is in most cases larger than for the random ones - again 
tagging datasets exhibit a different behavior. So there is an additional effect leading high degree nodes 
to have not as high an internal degree as expected by considering only the degree distributions. This is 
consistent with previous observations: the real case provides more internal links and fewer nodes with a 
low (but not null) fraction of such links, which must be high degree nodes. This stems from the fact that 
if nodes u and v are neighbors, the probability that N(v) \ {u} C N(N(u) \ {v}) is all the more important 
if v has a small degree and u a large one. Therefore we expect that degree-correlated graphs yield larger 
slopes than degree-anticorrelated ones. Yet, a more quantitative understanding of these phenomena calls 
for a study of the degree correlations in real-world graphs. 

5 Removing internal links 

When modeling complex networks using bipartite graphs UNewman et al. 200T1 IGuillaume et aL~2 004"l. 
the presence of internal links may be a problem as they are poorly captured by models. To this re- 
gard, removing internal links before generating a random bipartite graph may lead to better models. 
Moreover, internal links are precisely these links in a bipartite graph which may be removed without 
changing the projection. As the bipartite graph may be seen as a compact encoding of its projection 
ILatapy et al. 2008 ], one then obtains an even more compressed encoding. Considering the example of 
the P2P -files dataset, it demands 30 MB if stored as a usual 2-mode table of lists, while the corresponding 
_L -projection (i.e. users) demands 213 MB and the T-projection: 4.6 GB if stored as table of edges. 

However, removing internal links is not trivial, as removing one specific link (u,v) may change the 
nature of other links: while they were internal in the initial graph, they may not be internal anymore after 
the removal of («,v). See Figure [6] for an example. Therefore, in order to obtain a bipartite graph with 
no internal link but still the same projection (and so a minimal graph to this regard), it is not possible in 
general to delete all initial internal links since this would alter dramatically the structure of the projection. 
The set of internal links must therefore be updated after each removal. Going further, there may exist 
removal strategies which maximize the number of removals, whereas other may minimize it. 




ABCD ABCD 



G G' = G-(A,i) G' ± = G ± 

Figure 6: Influence of the deletion process on internal links. {(A,/), (B,j),(C,k),(D,l)} are _L -internal 
links of G, yet deleting (A,/) leads to G' where {(B,j), (C,k), (D,l)} are no longer _L-internal links, as 
they are the only links in G' ensuring that A is connected to respectively B, C and D in G±. 

To explore these questions, let us consider a random removal process, where each step consists in 
choosing an internal link at random and removing it, and we iterate such steps until no internal link 
remains. Figure |7]presents the number of remaining internal links as a function of the number of internal 
link removed for typical cases. We also plot the upper bound Ej — x (where x denotes the number of link 
removals), which represents the hypothetical case where all links initially internal remain internal during 
the whole process. 

This random deletion process leads to a pruned bipartite graph, containing the information of the 1- 
mode graph. Going back to the example of the P2P dataset, the obtained 2-mode storage graph demands 
12 MB for the related _L -projection and 22 MB for the T one, thus enabling a compression to 0.40 (resp. 
0.73) when compared to the standard 30 MB bipartite representation of the graph — which is itself a 
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Figure 7: Number of internal links remaining as a function of the number of deletions. Red thick line: 
random deletion process, blue thin line: theoretical upper bound. 



compact encoding of the projections. 

To go further, one may seek strategies that remove as many internal links as possible, for instance 
using a greedy algorithm selecting at each step the internal link leading to the lowest decrease of the 
number of remaining internal links. This is however out of the scope of this paper. 



6 Conclusion 

We inuoduced the notion of internal links and pairs in bipartite graphs, and proposed it as an important 
notion for analyzing real-world 2-mode complex networks. Using a wide set of real-world examples, we 
observed that internal links are very frequent in practice, and that associated statistics are fruitful mea- 
sures to point out similarities and differences among real-world networks. This makes them a relevant 
tool for analysis of bipartite graphs, which is an important research topic. Moreover, removing internal 
links may be used to compact bipartite encodings of graphs and to improve their modeling. 

We provided a first step towards the use and understanding of internal links and pairs. Further 
investigations could bring us more precise information about the role of internal links, in particular 
regarding the dynamics. We suspect for instance that internal pairs may become internal links with high 
probability in future evolution of the graph. One may also study these links (and pairs) which are both 
_L- and T-internal, as they may have a special importance in a graph. 
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